DATA.DEF will estimate distribution characteristics for data sets of any size having up to 100 sample points. Twenty seven statistics (see below for detail) are computed and used to classify distributions into one of 24 cells of Tail Weight and Symmetry/Asymmetry as defined in Micceri (1989). A minimum of 20 observations is required to properly compute all statistics. It has been developed to work on three platforms (Macintosh - Excell or any spreadsheet capable of working with Excel files; MS-Dos compatibles - Lotus 123, Excel or any application capable of using these files; Amiga - Advantage and Cross-Doss). If you received the incorrect form, please send me a message.
*************************
DATA.DEF is SHAREWARE
*************************
Please pass it on. If you like it, use it for more than one month, or want updates and revisions, please send between $5 and $15 to the starving author and his big, voracious family:
Theodore Micceri
527 Lantern Circle
Temple Terrace, FL 33617
(813) 988-0056
FAX (813) 974-3493
CIS 70303,1275
Also, please send comments or any improvements you make. Note that I avoided Macros and other simplifying techniques to allow easier transfer across platforms.
*************************
DOCUMENTATAION
*************************
Operating on the assumption that the nature of score distributions influences appropriate statistics, statistical findings and the interpretations allowable for different data sets, this program was written to enable comprehensive description of data sets and their classification into categories that allow one to estimate how robust various statistics will be when applied to said data. Authors too numerous to mention have noted that most statistics are influenced by the distributional characteristics of data to which they are applied. However, most of these investigated data sets falling beyond Tail Weight Category 6 (Double Exponential) or at or beyond Symmetry Category 4 (Exponential). In fact, most data sets in all fields do not fall into those categories. My research and evaluation of the robustness literature suggests that distributions falling in asymmetry categories 1 or 2 and below Tail Weight category 6 should not prove too damaging to OLS-based statistics. Even distributions falling into asymmetry category 3 are often not very dangerous. However, one should be quite careful about interpretation for any distributions falling outside of Symmetry Cell 1 and Tail Weight Cell 3 (About Gaussian).
Four levels of symmetry/asymmetry and six tail-weight categories are defined as:
By crossing these categories, twenty four cells are defined that range from (1,1) relatively symmetric, uniform tail-weight to (4,6) exponential or greater tail weight and asymmetry.
For both tail-weight and asymmetry, the “moderate” and “extreme” contaminations are defined by mixed normal distributions where moderate contamination (5%, +- 2 std dev) represents at least twice the expected number of observations more than two standard deviations from the mean, and extreme contamination (15%, +- 3 std dev) represents more than 100 times the expected number of observations more than three standard deviations from the mean.
Due to the inherent complexity of typical multinomial data, it is necessary to study several estimates for each distributional characteristic. As Elashoff and Elashoff (1978), discussing estimates of tail weight, note: "No single parameter can summarize the varied meanings of tail length." The same is true of symmetry or its lack (Hill and Dixon, 1982; Gastwirth, 1971). The recent trend toward extensive exploratory data analysis further emphasizes this fact. Therefore, in addition to skewness and kurtosis, the following estimates of symmetry and tail weight from Micceri (1989, p. 158) are computed:
Three measures of symmetry/asymmetry
(1) M/M intervals (Hill and Dixon, 1982), defined as the mean/median interval divided by a robust scale estimate (1.4807 times one-half the interquartile range),
(2) skewness, and
(3) Hogg's (1974) Q2, where:
Q2 = U(05) - M(25)/M(25) - L(05)
where U(alpha)[M(alpha), L(alpha)] is the mean of the upper (middle, lower)[(N+1)alpha] observations. The inverse of this ratio defines Q2 for the lower tail (designated by a minus sign).
Note that among estimates of asymmetry, Q2 is sensitive to densities in the distant tail, that the third moment skewness estimate is sensitive to one extremely long tail, and that the standardized mean/median distance is sensitive to asymmetry anywhere in the distribution. The third moment skewness estimate tends to underestimate asymmetry, and underestimate it to a greater degree as the level of asymmetry increases (Micceri, 1989).
Two different types of tail-weight measure were also computed:
(1) Hogg's (1974) Q and Q1, where:
Q = U(05) - L(05)/U(50) - L(50)
Q1 = U(20) - L(20)/U(50) - L(50)
(2) C ratios of Elashoff and Elashoff (1978) - C90, C95 and C975 (the ratio of the 90th, 95th and 97.5th percentile points, respectively, to the 75th percentile point).
The Q statistics are sensitive to relative density, and the C statistics to distance (between percentiles). Note that percentiles are computed within class intervals, assuming discrete rather than continuous data. No other readily available packaged statistical application of which I am aware computes percentiles in this way, although it is the only appropriate one. Please write if you desire a copy of the paper describing the rationale and method (Micceri, 1988).
REFERENCES:
Elashoff, J. D. and Elashoff, R. M. (1978). Effects of errors in statistical assumptions. In W. H. Kruskal and J. M. Tanur (Eds.) International encyclopedia of statistics (pp. 229-250). New York: The Free Press.
Gastwirth, J. L. (1971). On the sign test for symmetry. Journal of the American Statistical Association, 166, 821-823.
Hill, M. and Dixon, W. J. (1982). Robustness in real life: A study of clinical laboratory data. Biometrics, 38, 377-396.
Hogg, R. V. (1974). Adaptive robust procedures: A partial review and some suggestions for future applications and theory. American Statistical Association Journal, 69, 909-927.
Micceri, T. (1989). The Unicorn, The Normal Curve, and Other Improbable Creatures. The Psychological Bulletin, 105:1, p. 156-166.
Micceri, T. (1988). Discrete, Lumpy Data; The Median, and Dislocated Computer Algorithms. Paper presented at the American Statistical Association Conference, San Antonio, TX, January.